Statistical Learning from Multiple Information Sources
نویسندگان
چکیده
In intelligent information processing tasks such as pattern recognition and information retrieval (IR), probabilistic models are now widely used because they can represent ambiguities of observed data and are robust against noises. Parameters of probabilistic models are statistically estimated (learned) from given training data. However, when the training data contain an insufficient amount of information, the learned model becomes unreliable and its performance severely deteriorates. This thesis proposes two novel learning algorithms that use multiple information sources to mitigate this information scarcity problem in the following two applications. The first application is solving classification problems in which optimal class labels are automatically assigned to observations whose class labels are unknown. Among various types of classification problems, this paper considers classification of sequences that consist of sequential observation points. As a classifier, we focus on the hidden Markov model (HMM), which has been widely used for the classification of sequences. Generally, an HMM is trained on labeled data that consist of observed feature values and class labels. However, due to the high labeling cost, the amount of labeled training data is often small. In this thesis, we propose a learning scheme called semi-supervised learning to improve the classification performance even if the amount of labeled training data is small. The proposed scheme uses both of a small amount of labeled data, and unlabeled data that are not usually used for learning. First, we design a suitable HMM structure for using the unlabeled sequences. Then, we formally derive a semisupervised learning algorithm in which the convergence property is theoretically guaranteed. Next, we apply the proposed method to two types of time series sequences: those acquired from sign language sign data and those acquired from speech phoneme data. Experimental results show that the proposed method outperforms conventional methods. The second application is solving IR problems, especially cross-media IR in which queries and corresponding target data belong to different media. More specifically, this thesis focuses on situations where queries are represented as text and target data are images. When textual annotations explaining the contents of images are provided, such cross-media image retrieval can be regarded as ordinary textual IR. In text retrieval, associations between words can be learned from data and used to relate queries and text in the documents. However, in image retrieval, the number of annotations is too small to learn the word relationships. Using the fact that annotations and images are related, we regard images as the paired data of annotations. ∗Doctor’s Thesis, Department of Information Systems, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-DT0161006, March 24, 2004.
منابع مشابه
Designing collaborative learning model in online learning environments
Introduction: Most online learning environments are challenging for the design of collaborative learning activities to achieve high-level learning skills. Therefore, the purpose of this study was to design and validate a model for collaborative learning in online learning environments. Methods: The research method used in this study was a mixed method, including qualitative content analysis and...
متن کاملGBC: Gradient boosting consensus model for heterogeneous data
With the rapid development of database technologies, multiple data sources may be available for a given learning task (e.g., collaborative filtering). However, the data sources may contain different types of features. For example, users’ profiles can be used to build recommendation systems. In addition, a model can also use users’ historical behaviors and social networks to infer users’ interes...
متن کاملStudying Workplace Learning: Case Study
The present research aims to study workplace learning in the Petrochemical Industries National Company. The research is of qualitative nature and uses grounded theory approach, introduced by Charmaz. Data collection methods included interviews, analysis of documents, observation and focus groups. The statistical population of the research consisted of 350 employees, 16 of whom were selected thr...
متن کاملRatio Rule Mining from Multiple Data Sources
Both multiple source data mining and streaming data mining problems have attracted much attention in the past decade. In contrast to traditional association-rule mining, to capture the quantitative association knowledge, a new paradigm called Ratio Rule (RR) was proposed recently. We extend this framework to mining ratio rules from multiple source data streams which is a novel and challenging p...
متن کاملLearning from Semantically Heterogeneous Data
Advances in the Semantic Web technologies present unprecedented opportunities for exploiting multiple related data sources to discover useful knowledge in many application domains. We have precisely formulated the problem of learning classifiers from a collection of several related ontology extended data sources, which make explicit (the typically implicit) ontologies associated with the data s...
متن کاملLearning with Multiple Similarities Learning with Multiple Similarities
Title of dissertation: LEARNINGWITHMULTIPLE SIMILARITIES Abhishek Kumar, Doctor of Philosophy, 2013 Dissertation directed by: Professor Hal Daumé III Department of Computer Science The notion of similarities between data points is central to many classification and clustering algorithms. We often encounter situations when there are more than one set of pairwise similarity graphs between objects...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004